DOCUHENT BESOHE 



fa 860 539 

_ - . - - , - ; 

George^ ;3^rchiis _^ ____ _ j 

Thebretical ahd Practical eonsequences of t:i€ Use of 
Stasaardiz€d ^?€siduais as Easch^iiodel Fit 
Statistic_s. ^ _ ^ 

Texas /oniv*, Austin* Besearch and bevelbpinent Center 
jEor Teacher Education* ' - 

-National Inst* of Education (DHSW) ^ Sasfaington, 
P.C..__ • ' * . 

Spr 79 , ; ^ . 

■ 56p* : Paper presented at the annual Meeting ""^of the 

American Educational Besearch Association ?63rdr San 
. FrahciscoT^ CAf ipril 1979), _ 

MFdl/PC03 Plus Postage. 

Academic Ability t Computer .Programs: Difficulty 
level: *Gpodness of Fit: *Iteia^ Analysis: it eo Banks: 
^Latent Trait Theory:. *aatheinaticai^ Hodeis: Test: 
Items _ - _ - 

Item Discrimination (Tests): *E^sch flbdel: 
*Standaraized Besiduais 



The appropriateness, of the use of th^ standardized 
residual (SB) to. assess congruence between sample test item responses 
.and the one parameter latent trait (Easch) item characteristic curve 
is investigated* Lat.ent' trait- theory is reviewed^ as well as theory 

of the 'SB ^ the-apparent' error in calculating the expected 

drstriHutipn of the SE^ and^ implicaticns of using the SB for item 
analysis,, _ Empirical results using actual data are presented to' 
support the theoretical anlaysis, as well as a "demonstration- cf the 
practical implications of the failure ^^^^^ items which do not 

fit'^the mcdeli Conciusi*"©^ based on tSe findings include: (1 ) 
discriffiinaticns of all the items in a _ test inust be very similar in 
order for Basch model analyses to work 'in practice; (2) the, SB mean 
square fit statistic does not detect unacceptable variation in 
discrimina^ticn: and (3) item discrimination needs to be monitored and: 
controlled using mbre__ exact tests of fit than t§e, residuals mean 
square. Finally^, an alternate linear ' model ^s describee! which may 
provide a practical solution to prcbijems encountered in the 
construction cf item banks and tailored testing. (Hi) 



ED 191 915 

AOTHOE . 
.TITLE ' 

INStlTOf les 

SPCNS 'AGENCY 

POE_DATE : 
NOTE ' 

EBFS PBieS 
DESeilFTOPS 



IDINTIFIEBS 
ABSTEAef 



* * Beproductions supplied_ by EDESare the best that can be made * 
* *. from' the original dbcumeht^ :* 



U S DE PARTME NT OF HE AtTMi 
EDUCATION A WELFARE 

.1 ' ' NATIONAL iNStitUTE OF 

EDUCATION 

^ — ^^'^ -DOCUMENT HAS BEEN ftEPRO- 

lf\ DuCeD_EXACTt.y-.-AS-i?.ECeivEb PROV! 

. - ''E^SON OR ORGANIZATION oeiGiisL- 

t— I " - '^Tj'^O-lT-PbiNTS OP VlEVy OR OPINLONI 
. . : ; ^IfTED.QO.NOT .NLECESSARicY REPRE- 

/T*S ^^"^^ 9*'*''C'AL NATlONAt iNSTLtutE OP 

^ ^ • - ■ EDUCATION POSITION OR POUCV 

t— I 



THEORETICAL AND PRACTieAL"^ eONSEQUENCES 
OF THE USE OF STANDARDIZED RESIDUALS 
AS RASCH MODEL FIT STATISTICS 



Archie A. George 



"PERMISSION To REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



to THE EbUCATjdNAL RESOURCES 
INFORMATION CENTER (ERIC)." 



Procedures for Adopting Educationai Innovations Program ' 
Research and Deveiopment Center for Teacher Education 
The Uni^?ersity of Texas at Austin 



<5s 
B 

^0 Paper presented at the annual meeting of the 

American Educational Research Association, : ^ 

— ' San Francisco, April 9, 1979, Session 6.17 - ' 



ERIC 



2 



THEORETICAL AND PRACTICAL CONSEQUENCES 0F THE USE OF 
STANDARiDtZED ^SIDUALS AS RASCH MODEL FIT STATISTICS^ 

* 

• Archie A. George 

^^^.^ R'^^^search and Deveiopmeut Center for Teacher Education 

The University o:f Texas at Austin 

The standai:^-r:.*d residual is a statistic which has been used to assess 
the congruence between a sample of test- item responses and the one parameter 
latent trait (Rasch) it^ characteristic curve. However, the central thesis 
of this paper is that the statistic is not appropriate for this purpose. Its 
theoretical distribution is based on the central limit theorem which, of 
cbur^e^ requires a larger-sample size and yet its calculation involves very 
small sample sizes. The practical consequences of the use 'of this statistic 
are described and illustrated. 

More specifically, this paper consists of five main sections. The first 
section contains -a very brief review of latent trait theory and the models re- 
ceiving .greatest attention at this time. The second section contains a de- 
scription of the standardized residual statistic 'and an explanation of its ' 
thebreticar. basis. The apparent error in the calculation of the expected 
distribution of this statistic is pointed out, and the implications of the use 
of Che statistic for item analysis are described and justified on a theoretical 
basis. In the third section^ empirical results using actual data are presented 
which support the theoretical analysis, in the fourth section, "the practical 



The research described herein was conducted under contract with .the " 
National Institute of Educatibn. The opinions expressed, are thbse of the ' 
author and do not ^necessarily reflect the pbsition or policy. of the National ' 
institute of Education, and no endorsement by the National Institute of ' 
Education should be inferred. ' ~ ^ 



3 



impiicacions of the failure tb reject items which do hot fit the model are 
demonstrated. Finally, ah alternate model is described which may provide a 
practical solution tb problems encountered in the construction of item banks 
arid tailored testing. ' 

" : V . Part 1. Latent Trait Theory 

Latent trait theory has been proposed as an altetriative tb classical test 
theory for the assessment of ability arid educational achievement. A latent 
trait model specifies a relatibriship between observable performance and un- 
observable traits (abilities) which are assumed to underlie perfdrraance. The 
latent trait models currently under study in educational measurement specify 
mathematical. formulae which relate ability to probability of a correct response 
on specific test items. Figure 1 shows a hypothetical relationship, between 
ability and correct response, which-, is called the item characteristic curve. 



Figure X. 



P<c) 

Probability 
of a correct 
response 



Low ABILITY High 

Lord (1952, 1953) investigated the use of the ndmial ogive as a model for 
performance on mental test items. The' mathematical complexity of this model 
did not encourage its full development, but the groundwork was -laid for later 
work using simpler models. Currently, three latent trait models seem tb be 
receiving the most intense research. These models are very closely related, 
mathematically even though they are the- result of different lines of development 



The simplest is the Pis^h irib^eli 

P(c) ^ X , 

_ _ '5 
where ^ = e . . . 

in this model, ^5 is the difficulty of the item and 6 is the ability of 
the person. This model is often referred tc as the "one parameter" mbdel^ • 
because the item characteristic curve is couipietely determined by only one 
parameter, the difficulty of the item. 

A model of. intermediate complexity has been proposed: 

P(c) ^ X/ , ^ 
^ 1 -F X 

^ ^ - -ct(B-5) 

wnere X = e . 

This model is the same as the Rasch model, except for the parameter a, 

which is called the "item discrimination" parameter. This pairametef describes 

the slope of the item characteristic curve. Thus, the two parameter model, 

_ _ _ _ ' 

as this model is frequently referred ttfThas more flexibility. That is, a 

wider variety of item characteristic curves can be described using this model. 

A three parameter model has also been proposed: 

P(c) = Y + ji-j) X 



'1 + X . 
where X = ^^^-^K 

In this model, y is referred to as a "guessing" parameter. It functions 
to modify the lower asymptote bf the item characteristic curve so that the pro- 
bability of a correct response never reaches zerb, rib matter how lovj the ability 

. _i I 

of. the test taker. 

The Rasch model is currently receiving a lot of attention from test users^ 
school districts, and others, ^is is primarily dui" to the efforts of Ben 
Wright and his colleagues at the University of Chicago. _ The simplicity of- the 



brie parameter model has enabled stat±sc±cians to 'develop techniques of esti- 
mating the item difficulty and person ability. What could be described as 
cookbook procedures ha:ve been developed for the use of the Rasch model in many 

testing situations (Cohen, 1976,' iri Wright^ 1977). Estimation pfobiems are 

- - ^ ----- _ - , _ _ _ * 

more difficult for the two and three parameter models i 

One advantage offered by use of latent: trait uicdelsv.is that different 

• ... - ^^^^ 

tests can be used to measure students of different ability "while maintaining 

comparability, of scores. That is, each student could receive a\^iffererit test 

and the abilities of the students could still be placed on a single scale. Low 

_ . . _ _ _ _ . ' . \ 

ability students can be tested using easy tests and high ability stu-icncs tesced 

using more difficult te3ts^ and the measurements reported on a single sc^le. 

The Rasch: model offers one advantage over the other latent trait modeisi 

the number of items correct on a test is all thkt is riecessary to estimate a ^ 

person's ability. Correct responses to difficult items do not count any more 

than correct responses to easy items.- The Rasch model has been shown to be 

the only logistic- latent trait model which ha?: this property (Andersen^ 19775 . 

According to the other logistic models, a person *s ability is estimated as a 

function of the difficulty of the items that person marked correctly, not merely 

how many. : . 

Part II. A Theoretical Analysis of the Standardized 
Residual Fit Statistic 

Accbrdirig to Wright (1977), the two major advantages' of the Rascir>odel 
are sample-free item calibration and test-free person measurement. Sample-free 
item calibration refers to the concept that the difficulty of test items can be 
estimated regardless of the abilities of the persons who respond to the items.. 
Test-free person tneasuremerit refers to the concept that the ability of persdris 



can be estimated regardless of the particular items in the test to which they 
they respond. Together^ these properties allow for the construction of a 
personalized testing program which Seasares high ability students using tests 
containing difficult items and low ability students' using tests containing easy , 
items. Scores on all tests can be con verted (vertically equated) to Seasures - 
of ability on a cbmmori ^scale, and new test itecis can be calibrated without con-^ 
trolling for the ability of the sample. • 
Several investigators have exaxaiued the ability of the Rasch ciodel to live 
up to these promises. Anderson,' Kearney and Everett (1568) and TiSsiey and 
Dawis (1975) found Rasch item difficulties to _be fairly invariant for particular 
sets of Items when based on different samples of test takers. Whitely and Dawis 
(1974) and Sliride and tinn (1978) attetnpted to replicate Wright's (1968) re- 
suits ^ for the problem of vertical equating. According to Slinde and Linn, 
Whitely and Dawis' results were not .as good as those of Wright, but were judged 
to "lend some support for the item-free person tneasurement claim b£ the Rasch 
model" (1978, p. 26). However, Slinde and Linn found that for the-math data 
analyzed in their study^ "the Rasch model did not prdvide a satisfactory means" 

of vertical equating" (1978, p. 34). --Sey wertt on to say that it may be nec- 

_ i ' ■ -> _ 

essary to more carefully select items that fit the model. [This is also the . 
recommendation of Keats, and Bold t, as reported by Angoff (1971, pp. 529-530);] 
Slinde and Linn did not test items for fit to the Rasch model. They acknowledged 
that this may have^been responsible for the inadequate vertical equating results 
they obtained. " 

I^lhile working with the Rasch model,' Dr. Donald Veldmah and I noted that 
the test of item fit recommended by Wright (1977, p. -102) indicated that a 
very large number of .items fit the Rasch model be^tter than could be expected 
by chance. That is, the standardized residual fit statistics were very low 



7 



for iiiany items. I found this to be trae in published article?, also (Perlirie, 
Wright and Wainer, 1977; Wright" and Mead, 1977), and initiated this scady.bf 

1. \ ■ 

the fit statjrt>^^. • - . 

The^tandarc^zed Residual statistic is computed as follows. For each 
perspmwho attempts aS item, a z-square is computed: 

\ P(i-P) : _ . 

where P = e^/(l+e^), x = b-d : - 

and X = 1 if correct response 
G if incorrect, 
where b = estimate of the person's ability 

and d = estimate of the item^s difficulty. , . 

— 2 — " 

. This z conveniently reduces to e"" for a correct response and e^ for an ' 

incorrect response. Clearly, when x is large (i.e.^ a person's ability is 

much greater than the item's 'difficulty) , the predicted probability of a 

^correct response is very high- For instance, if x = 4, 

P(c) = ^^/a+e^5 = 54.6e/55.'60 = .98. - ' ' . 

If the person gets the* item correct, a z' of e"^ (-02) is ^^igned. If the 

- - 2 - _4 - - _-) ___ 

person misses the item, a z of e (54.6) is assign^/^ These z are summed 

over persons and di\-ided by the number of persons to obtain the value of the 

standardized residual. This statistic h;as also be'en called the mean squared 

error, or fit mean square. Wright, Mead and Draba" (19-76) claim that the sta- 

tistic will be high for items with both high and low discriminations. This 

statistic is supiJbsedly' distributed -as a chi square with' N-1 degrees of freedom. 

(Panchapakesan, 1969^; Wright and Panchapakesari, 1969) , or as a mean . square 

with expected value 1.0 and variance. 2L/(L-1) (N-1 5 , L = numbed of items, N = 



humbe^ 5f pef^ns (Perliiie, ffrighc and Wairter^ 197-7);; Wright ; arid his colleaguei 

iiave recotmnetided several mddif icatibris of this simple J z^/N formula.' These 

adjustments were made because the distribution of the statistic does not • - 

conform to theor^icai expectation. However, there are fmdamental probleihs 

with the statistic which cannot be corrected by such adijustment. 

The basic idea in the formulation of Che pean is that ::deviat ions 5f 

obtained scores from expected values cari be converted to z-sco^es by .dividing 

the deviations by the standard deviation, squaring, and averaging across per- 

«ons. Lower mean z should indicate small deviation, in general, while 'large 

2 . - ' _ ■ ' / ' 

mean z should indicate greater deviations. Table 1 shows how'-he^jf^^depen- 

dent the standard deviation of the binomial distribution is upon the sample ' * 

size for several values of the probability of a correct response. In computing* ' 

the z fit statistic, the sample size is always 1, since it is calculated for 

each person-item encounter and averaged over persons to assess item fit, and 

over items to assess person fit. From this point of view, the test should be 



very conservative- (i .e . , rarely reject items), because the ^standard deviat 

2 ■ ' 

are large, which 'makes z values small. ' ' - 



ions 



A more serious problem is the use of a normal approximation Co the bi- 
nomial, which is- iipiicit ir^'the expectation that the sum or. mean z^ is chi 
squared distributed. Table 2 shows the magnitude of errors which are intrb- 
duced when the z fit statistic is assumed to be sampled frbmv.a normal dis- 
tribution. Table 2 shows several actual probabilities of a correct response 



to an item [P(c)], and the probabilities which would be inferred to exist if 



-2 _ ■ 

the z were a normal deviate. 



When the actual probability is .50, the inferred probability for both ' 
cffrrect and incorrect responses (1 and 0) is .32. When the actual, probability 
of a correct ^eip5nie is M, a correct response is.infi^irid to have a 



ERIC 



Sample Size 



ERIC 





1 


s 




1 n 




.50 




. 










-49 








HQ 


.70 










. uo 


.80 




.18 




.13 


.07 


.90 


.30 • 


.13 


6 


• 09 


.05 


.95 


.22 , 


.10 




.07 


.04 


,98 


.14 


.05 






.03 


-99 


.10 • ' ' 


.04 




, -03 


.02 



Table 1. 



SD of binomial Distribution 



ion (rij- 



N 



V) 



P(c) 



Inferred Probability 
of Correct- -Re spon se 



- Inferred Probability 
of T:ncorrect R & spdns g 



.56 


.32 


• 


.32 


:6b 


.41 




.22 


.70 


;52 




-.13 


.80 


.6i 




.05 


.90 


.74 




; .0027 
.0800 


;95 


;82 




.98 


;89 




.0000 


.99 


.91 




.0000 



Table 2. 

_ Actual and inferred probabilities ^sumlng the 
tandardiz-ed residual is a z-score sampled from a nornial distribution. 



: 10 

probability of .6i,- and an ircorrect response a probability of .05. These 
errors are introduced because the calculated z is assumed to have been saiapled 
from a hbrmal distribution ^ which would be appropriate only for large sample 
§izes; . ' * *^ 

The values in Table 2 were calculated as follows. Since * ^ 
^ ^ P(c) ^ -^Dfi^. _ 

for' any specified P(c). 

— 2 " ~x - - - 2^ 

The z for a correct answer is e , and for an incorrect -answer e ^ ac^ 

cording to the formula presented" earlier: 

_ _ r- - - 

_2 - 2 

2 = [X - P(c)-j- , which reduces 

. P(c)[l-P(c)] 

algebraically to e . when X = 1 and e when X = 0. The square root of each 

2 - - '- - ~x -X _-- 

z (i*e., e , e ) was calculated and the corresponding value [P(c)] found in 

a normal probability table. The inferred probabilities in Table 2 are 2[1-P(c)], 

which represents the probability of bbtainihg a value as deviant or more deviant 

than one with the z score with which the table was entered. 

With a little reflection, these considerations reveal how very small fit 
statistics come about for some items. Items which are easy are answered cor- 
rectly more often than the Rasch model predicts^. Each time this happens, a 

_ _ 2 ^ 

very small z is added to t^ie "sum of squared residuals," and the result is a 
small mean squared residual — which supposedly reflects good fit to the Rasch 
curve! it is also possible to see how greatly this statistic cari be affected 

by a few students of low ability obtaining correct responses, perhaps by guess- 

: - _ _2 -- 

ihg. For each unefccpected correct answer, a very large z is added to the sum 

_ __ ^ _ 

of squared residuals, producing a larger mean squared residual. Table 3 shows 

- 2 

the 2 ^values- for correct arid incorrect responses at several values of 



11 



ERIC 



■ • . 2 for Correct Response- for Incorrect Reipo^jg^ 

• •. i;oa ' . • i j^QQ 

•60 ■ .~ ~ - .67 ; ' - 

•70^^ ■ • ' • 

.80' ' .25' 

-95 - - .05 : i 

•98 ■ . .02 
.99 -, .01 



1-50 
2.33 
4.00 
9.00 
19.00 
49.00 
99.00 



Table 3. 

Values of the squared standardized residual fof 
several probabilities of a correct response. 



13 



• ■ - ' . 12 



Notice Che extremely high low values of z^' for values of P(c)' above ;9b; 

, A more direct illustratibh might be* to consider the- imisiications of the 

statistic for a number of persons of similar ability whose prbpdrtibn bf^ cor- 

i ■ • 

. rect responses deviates from the theoretical proportion in. certain ways. For ■ 

example, if 40 persons obtained a total score on a test .to whidk the Rasch 

model ass^gns an ability rating of 1.386, thd model predicts that 80 -percent, of 

these should 'get an item with difficulty 0-0 correct. If exactly 32 (80Z) do 

indeed perform as predicted, the sum of squared re'siduals would be: 

Iz^ ='32 (e^l-39) 4 8 (e^-3^5 = 40.;0D " ' . ' 

and the meaii squared residual = 1.00. " There'is, of course, no difference 
between -obtained and expected propolrtion for this sample. • • 

Let us suppose that '36 of the 40 individuals (90%) provided correct 

responses: . - ' 

2 ' ~ ■ " • 

iz = 36 (.250) -K4 (4^00) = 25.00, and the mean squared residual = .625. 

Thus, it can be seen that a deviation in this direction might lead one to in- 
fer that the data fit the model "better -than" when no deviation was present! : 

^ta^the other hand, let us suppose that 28 of the 40 individuals (70%) 
provided correct responses: 

- 28 (.250)^ + 12 (4.00) = 55.00, arid the mean .-squared residual = 1.375. 
In this case, a person would be led- to "the inference" that the data^'did not fit 
the model as well as the previous data, even though exactly the same deviation- 
is present. , 

It might be appropriate, since there is a fairly large group involved, to 
use the normal approximation to the binomial to 'test for significance of.. the 
differences between hypothesized and expected frequencies of correct responses.- 
That is, for each case: " -« ' ■ 



13 



2) 2 = * ^ ~ • 8 __ ' — — — - - -- 
. V : 40 



.J) 2-==, - =-1.58 



V 40 



\ . 

!- 



^ What a difference grouping of scores makes! 

• , Items which are very difficult for the sample of studeiits an which fit 
statistics are- based often appear to fit the Rasch cutve poorly because a few 
students do answer these correctly. Items which arelli^ easy are usually in- 
f erred to^fit the Rasch curve very well, because even more students answer 
them correctly than ire predicted by the model. Items whiphhave difficulties 
ttear the ability- level of the sainple usually appear to be a good fit because 
of another factor^— the .Jlasch curve is . flatter than most actual item charac- 
teristic curves. This results, in more incorrect responses than predicted when 

___ ___ ___ .^c --.^ 

P(c) is less than- .50 and more correct responses when P(c5 is greater than .50. 
Most z are less th^ 1.88; so-is the mean z^ Figure 2 shows three item ' 
characteristic curves: the Rasch curve, one steep^ curve and one flatter 
-curve. ^-The steeper and flatter curves have "exactly the lame deviation ffoni 
the Rasch. The mean squared deviations were calculated for each^ cufve ^d 
are shown. in Table 4. The^e values "indicate that the steeper curve fits the 
'model best — evm better than the Rasch curve itself! IjLe flatter curve fits 
the least well, it is for this reason, in my opinion, that' the mean -squared ' 
residual has become a^ widely used index of fit. of 'data to the.Rksch model. ' 
The steeper the item characteristic cUrv^, the better the item is inferred to 



ERIC 



1.0 



Steeper 



.9 



.7 



.4 



.3 



.2 



ERIC 




Three item characteristic curves: 
the Rasch, one siightly steeper, 
and one siightiy flatter. 



16 



^2 



+3 



i5 



X 


Correct 
e 


Incorrect 

X 

e 


Curve 


ivascn 

. _? 

Curve 


Flatter 
• Curve 


-3;6 


20.0:? 


.05 


_ 0/100 * 


5/95 


10/90 


-2.5 : 


■ 12.18 


.08 


" ■ 3/97 


8/92 


13/87 


-2.0 


7.39 


.14 


8/92 « '■ 


12/88 


16/84 


-1.5 


4.48 


. .22 ' ■ 


15/85 ' 


18/82 


21/79 


-1.0 ' 


2.72 


.37 


25/75 


27/73 


29/71 


-.5 


"■ 1.65 


;61 


■37/63 


38/62 


39/61 . 


6.0 ' 


1.00 


,r.oo^ 


50/50 


56/50 


50/50 


.5 


;6i 


1.65 


63/37 J 


62/38 


61/39 


1.0 


" .37 


2.72 


■ 75/25 . . 


73/27 


71/29 

C 


1.5 

o n 
Z . U 


.22 

/ 
.14 


4.48 
7.39 


85/K 
. S2/8 


82/18 
88/12 


79/21 
■ 84/16 


2.5 


■ .08 


■12.18 


97/3 


?i / o 


0//13 


3;0- 


.05 


20.09 


100/0 


95/5 - : 


• 90/10 






■ \ 


855.56 


1316.55 


■ 1737.74 






Mean Squared 
Residual' . : 


■■.66 


1.01 


1.34 



Table 4 . ^ . ■ . , 

Calculation of; mean squared residual for* three curves ii^ 

_ Figure 2. ' "* ' 

*Tfiese ratios show the -proportion ^correct to incorrect. To calculate the 
mean squared^ residual,- m>iltiply the numeraror by the value in the e""^ 
column, multiply the denomenator by the value ±n the e"" column and add 
these two values; then sum this result across the 13 intervals shown 

^and divide by 1300. ■ " 



O . - : . ' 
ERIC . . : ' - 



. 16 

fit the Rasch inodel* ^ Thus, tests built using the tneari squared resicRial Tiave^ 
for the n^dst pa^t, been as good as test^ built using traditional test statistics. 
However, such tests" are only believed to have been constructed by selecting items 
which fit the Rasch modei. The selected items actually were those with the 
•highest dds crimination, just as in iraditional analyses. 



Part III; Examnation of a Set of Actual Test Data 
Which Illustrates the Prpbienis Encpuhtered When Using the " " - 
Standardized Residual as a Fit Statistic 

Because of tfie inadequacies in the standardized residual fit statistic > 
it was necessary to construct another procedure for testing the fit of the 
Rasch model to actual data. The procedure I devised is as fciiows. Veidman^s 
(1978) version of the PROX procedure (Gohen, 1976, in Wright, 1977) was used 
to estimate the item difficulties and person abilities .for each of the 1^64* 
students who responded to a 237 item .English achievement test . All 1664 stu-- 
dents had scores greater than. 0 and less than 237, enabling them to have their 
bilities estimat^^ci^^ihg to the model. Fit of' each item to the Rasch 
curve was assessed by fitting a least squares line to the data in the range of 
±2 units of the estimated ability-difficulty (b-d) . A t-test was used to deter 

mine the probability that the .data were sampled from a population in ^ich the 

_^ _ ~ ' ^ __ _ ■ > _»_ 

slope of the line was the same as the siaper of the Rasch model between these 
two values. 

Figure 3 shows the Rasch item characteristiG curv^ between the values ±2.0 
B-5 and a line fitted by a least squares to a uniformly distrib'utea set of data 
points on the Rasch- curve. The slope of this iSe is . 2694, and seems to ap- 
proximate the Rasch curve quite nicely. Table 5 shows the formulae for the 
t-test x^ich determined the significance of the difference betwfeen the' Rasch * 
""slope (.2094) and- the slope of each item*s data in. this range. - 

I have taken some care to investigate the adequacy of the linear model as 



a substitute for the Rasch curve in the raftge ±2 (b-d) . The error introduced'; 

was caicuiated^^ cdii?)uting the sum of squared deviations from t:he Rasch curve 

— - * --,-,*' o * 

of data which conform exactly to the. Rasch model. That isi^[PCc)] [i-P(c)] + 

______ 2 . 

tPCiic)] E0-P(c)3 where P(c) is the proportion o£ cdrrect scores predicted by 
the Rasch model ; . ' • 



SSx 



N 

g /.N ^2 



N I In 



19 



N 



Bl = SSxy/SSx 



SSE = SSy - 



VSSx ' 



s =• 




df = N-2 



= estiinated ability of peSon i 
d ~ = estimated difficulty of item 
^ ~ 1 if , person i answered item correctly 

= 0 otherwise 



Table 5> 

Formulae used tb_ assess the significance of the difference 
between the slope of actual item characteristic 
curves and the slope of the Sasch curve. 



. ■ . ■ _ : 26 

l~P(c) is the distance from the Raisch curve to the correct score, 

P(nc) is the proportion of incorrect scores, and 
. 0-P(c) is the distance frdiii the Rasch curve to the incorrect score, 
Smming these values across 41 intervals (-2,0 to +2.0 in .i Cerements, 
see Figure 3), assuming a uniform distribution of data throughout this in- 
terval, results in an error sum of squares (ESS) of 7.719602. Using the V 
linear model, the formula is^[P(c)3 fl-PCc)]^+ EP(nc)] [O-P(c) ]^,- where i(c) 



is the jprpportioh of correct scores -predicted by the linear model* The error 
sum of \quares in this case is 7.733142, or an error increase of only .2%! ^ 
if one were to test for the adequacy of the linear model using ah F-test 
comparing the two models (see Ward and Jennings, 1973), 

(i) ' " * 

where the X are binary (0,1) vectors specifying membership in one of the 

41 levels of b-d and Y is a vector cbntairiing o'bseirved scores (I = correct,: 

0 = incorrect) 'and 

Y = b^U ^ b^X + E^p ^ . (2) . ■ 

where U is a unit vegtor (all I's)^ and X contains the value of b-d for each 

of the observed scores in Y, - - : . 

the F would-be: 

(ESS - ESS,)/ (41-2) ' . , 

F= ± . - 



ESS^ / (41N - 41) ' 



where N is. the number of data points at- each level of b-d- In order to obtain 
a signific^t (p = .65) ^F with 39 and 41N-41- degrees of freedom (F = 1.51)-- 
approximately ,33,841' data points are necessary! 

For those who like to think in terras of multiple correlation coefficients 

-2 — 

(R ), these are _ . r 



2i 



. = i . (.7 • -71963(43.) ^ • _ 

for the Rasch model and 

_ _ (7.733i)fls 
^ - ^ (N) (.25) = -^^^^ 



-r 



for the linear model ' 
A Ifoate Carlo test of these caicTiiations ^s perfbrmed. This involved cre- 
ation of 33,825 data points which conformed exactly to the Rasch cur^e (825' 
in each of the M -levels of b-d) , and analysis of these data using the two 

» ^ 

models specified above using Veldman^s PREffi snbroutine REGRAN. The computed 
R^ were .2470 ;and .2457. The F was i. 513. Thus, for all'-practical purposes 
the linear model seems to be an acceptable approximation of the Btasch curve 
in this ranges 

Table _6 shows the item numbers, standardized mean square residuals, pro- 
■ portion of the sample^ obtaining correct responses, Pearson product moifeHt cor- 
relations of item-total test scores, least- squares slopi of the actual item in 
th'e ±2 b-d raSge, the value of the t statistic testing .for, slope of .2094, and 
the nt^er>f cases on which the' t was calculated (i.^t, the number of data - 
points in the ±2 b-d range) for a selected set of items. 

■ Table 6 is ordered according to decreasing slope of the leiit iquares 
.lines. Some -interesting relationships 'am^ong the statistics in this table are 
apparent. Items with steep slopes tend to have low residual mean squares and 
also have high item-teit correlations (CORR). In general, wlien the diflicuity 
of an item is about .5 m% correct) and the slope was high (.24+), the itea- 
tdtal correlation was .45 or greater, and the slope was significantly greater 
than^he Rasch^model. According to the slope test, only about one-third of 



ERIC 



22 



. Item # 


Heaxi 
Square 
Residual 


Prdpdrtibi 
Correct 


1 Rasch 
Difficult] 


Cbrre- 
f iation 


Slope 


t 


Test-N 








D m J J 


. UU 




.95 


Z7 


48 


.75 


.98 


-3; 38 




.48 


1.73 


32 


72 


• 62 


.97 


-2.78 




.39 


2.96 


65 • 


53 


. '69 




-2.94 


.35 


.35 • 


2.05 


52 












138* ^ 


.74 


.33 


2.24 


.51 


.36 


10.45 


142S 


31 


.62 


.95 ' 


^ -2.16 


.39 


.35 


3.45 


141 


136* 


.70 


.28 


2.50 


_48 


'':35 


8.98 


issi 


29 


.95 


.97 


-2.88 


.33 


.35 


1.91 


57 - 


10 


• 54- 


.84 


-.64 


.61; " 


.34 


7.11 


759 


206 ^ 


.64 . 


.87 


' -.88 


.56 


.34 


6.16 


612 


17* 


.73 


;38 


1.97 


.53 . 


.34 


9.63 


1495 


18* 


.74 


.45 


1.61 


.57 - 


.34 - 


10.58 


1558 


107 


.66 


.12 


3.75 


.33 


.33 


3.62 


659 


34 


.97 


.97" 


-2.68 


.30 


^ .33 


2.13 


71 


133* 


.73 


.65 . 


• 64 


.60 


.33 


10.07 


1480 


96* 


.76 


.37 


2.02 


.50 


.32 


8.42- 


1484 


98 


.63 


.89 


-1 .14 


.55 


••32 


4.67 


481 


97 


*64 


.90 


-1.23 


.54 


.32 


4.23 


411 


202* 


• 74 


.65 


.66 


.59 


.32 


9.33 


1479 


12* ^ 


.78 


.48 


1.48 


.54 


.32 


9.07 


1562 


232* ^ 


.70 ' 


^ . .70 


' .36 


.60 


.32 


^ 8.39 


1370 


^ 131 


.62 


;84 


-.63 


.59 


' .32 


5.59 


759 


226 


.59 


.82 


-.43 . 


.61 


.32 


6.09 


"878 


207 


.68 


.87 


-.92^ 


.54 


.32 


4.97 


599 


208 


.57 


.80 


-.24 


.60 


.32 


6.62 


993 


59 


.40 


.97 


-2.97 


.38 


* .31 


1.34 


52- ■ 


14* 


.65 " 


^ .77 


-.04 


.60 ^ 


.31 


7.15 


■ 1130 


' 215* 


.85 


.51 


1.33 


.53 


.31 


8.64 


^1566 


205 


.59 


. .87^ 




.57 


.31 


4.67 


599 


66 


.78 


.96 


-2.4.0 ' 


^ .34- 


.31 


1.91,' 


- 97 


13* 


.79 


.57 


i.03 


.56 


.31 


8.07 


155(5 


75 


.42 

f 

1 

l: 


.97 1 

' i_ 


;.-2.94 


.42 


.31 


1.20 


52 



o ■ 

ERIC . 



23 



iten 


.Heaih • 
Square 
Residual 


"X 

Proport:^ 
Correct 


ri Rasch 
Difficult 


Cbrre- 
y lation 


Slope 


t 


i 

fest-N 


5* , 

90* 
' 77 

15* 
152* 
• 16* 
197* 
166* 
3* 
101 
236 
137 
■ ±35 

20 
228 

37 
227 
198 . 

92 

83* 
114* 

87* 
7* 
108* 

2g- ■ 
210* 

32 
165* 
192* 


,80 

..74 

.24 

.78 

.83 

.77 
i69 
. . 88 
.76 ' 
.67 
•75 
.84 
.85 
.77 
.71 
1.03 
.71 
.70 

.74 

.97 
1.07 
1.04 

.98 
1.11 

^56 
1.03 

.62 
1.01 

.98 


.50 

.77 

.98 

.67 

.38 

.78 

.75 . 

.54 

.68 

.80 

.68 

.49 

.44 

.24 

.75 

■ -57 
.74 
.74 

.94 
.62 
.58 
-.52 
.67 
..42 
.92 
.56- 

.91 y 

.79 / . 
.64 


1.38 
-;03 
-3.00 

.54 
1.99 
-.14 

.08 
1.17. 

.51 
-.27 
-1.34 
1.4l'. ' 
1.66 
2.76 

.05 
1.04 

.16 

.16 

(79 items 
omitted) 

-1.88 
.77 
.99 
1.31 
.53' 
1.74 
-1.59 

2.14^ 
-1.35 
-.23 
- .68 


.'55 
-.57 
.44 
'.57 
.47 
.55 
.59 
.53 
.58 
.57 
.56 
.51 
.49 
.41 
.57 
.52 
.58 
.58 

.41 
.46 
.40 

' »• .41 
.44 
.38 
.46 
.42 
.47 

. .43 
;45 


« 

.30 
.30 
.30 
.30 
^36 
' .30 
. 30 
.30 ^ 
.30 
■ 30 

.30 
.29 
.29 
.29 
.29 
.29 
.29 

.23 
' .23 
.22 
.22 

:22- 

.22 . 

;22 

.22 

.22 

.22 

.22 


7.78 
^ :6.29 
■ 1.14. 
7.28^ 
6.60 
5.95 
6.30 . 
7,13 
■■'6.75 
5.29 
5.73 
6.89 
6.66 
4.79' 
5.78 
6.81 
5.95 
- 5.91 

.44 
1.26 
1.17 
- 1.12 
i.03 
-.93 
■ .38 
.94 
' .38 
.58 
.65 


1563 

1130 ■ 
51 

1448 

1495 

1066 ■ 

1218 

1567 
1428 
972 
1428 
1564 
1551 
1249 
1200 
1556 
1260 
1260 

201 
1506 
±550 
1566 
.1.428 
1536 

285 
1564 

366 
1025 
1479 



ERIC 



24 



Mean 
Squar^ 
Resi'duai 



'roportionj Rasch i. Corre- 
Gofrect Difficulty iatibh 



Slbtie 



Test-N 



: 229* 

62 
214 
231*. 

82 

79 

11 

19* 

27 
163* 

1 

ljs7* 
150 

42 
151 
217 

38 
172* 
■ 55" 
140 
190* 

47* 

58 
132* 
•157 
143* . 
159* 
168* 
153 



124* 
171* 



1-04 

• .87 
ii22 

■ . .95 
.87 
.69 
.80 
.x91 
-.66;' 
.1.13 

• .98 
. .76 

.24 - 

.78 
1.19 

.80 
1.22 . 

.33 

.57 
1.13 

.93 ,^ 
. 1.Q3 
"1.07 

.82 
i.40 
1.06 
1.35 

.94 



.51. 
.85 
.13 
.64 
.91 
.93 
.83 
^22 
i .94 
.75 
.67 ' 
-.94 
.98 
.89 
.96 
.92 
.27; 
. -98 
.94 
.75 
.74 
.92 
.56 
-88 
.47 
.72 
.27 
.82 



1.29 
1.17 



.72 
.53 



.87 
-.65 
3.62 
.67 
-1.41 
-1.64 
-.52 
2.9 
-1,96 
.09 
.51 
-1.93 
-3.42 
-1.11 
4.55 
-1.52 
2,. 59 
-3.03 
-1.88 
.09 
-.54 
-1.51' 
1.10 
-1.03 . 
-.01 
.28 
2.58 ' 
-.39 

1(15 items 
omitted) 

.24 
1.24 



-^3 
.44 
.24 
.^4 

.m 

.42 
.48 
.33 
.38 
.39 
.43 
.36 
.38 
.44 
? .18 . 

.4o: 

.26 

'.37 

.39. . 
1.04 

.32 
-»37 

.42 

.27 

.39 

.21 
••.41 



.29. 
.31 



.22 
:22 
.21 
.21 
.21 
.21 
.21 
\21 
.21 
.21 
.21 
.20 
.20 
.20- 
.20 
.19 
.19 
.-19 
'.19 
.19 
.19 
.19 ■ 
.18 
.18 
.18 
.18 
.18 
.17 



.15 
.15 



.63 
, -36' 
.18 
.'37 
.13 
.09 
.13 
-.05 
-.05 
-.23 
-.25 
--19 
-.06 
-.44 
-.18 
-.48 , 
-.98 
-.18 
-.47 
-1.25 
-1.55 
- --84,. 
-1.87 
-L.15 
-2.07 
-2.05 
-1.83 
-1.95 



-ii.l2 
-4,51 



1526 
759 
" 753 

1479 
343 

.266 
823 

1167 

- 182 
1218 
1428 

' 186 
31 

481 

185 

311 
' 1318 

- 44 J 
201 

1218 
1260 

311 
1564 

548 
1557 
.1317 
1318 

901 



1302 
1568 



ERIC 



27 



25 




26 





Meaii 






- 












Proportioi 


. 

I iiasch 


Cdrre- 








Item ? 


Residual 


Correct 


Difficult} 


r latibh ' 


Slope 


t 


JL C O l» ill 


147 


.98 


■ .83 - 


-.54 


- 

.36 


.11 


-4.95 


801 


70* - 


2.08 


.21 


3.08 


.06 


'.10 


-5.56 


1110 


178* 


2.01 


.36 


2.08 


.06 


.10 


-7.20 


1474 


139* 


1.55 


..36 


2.08 


.16 


.10 


-7.32 


1475 


181 


1.50 


.79 • 


-.21 


.19 . 


.09 


-7.65 


1025 


17&t'': 


1.62 


.66 


.57 


.17 ; 


■ . 08 


-9.06 


1448 


25 


■ 36 


.98 


-3.55 


-31 


'.08 


-1.00 


30 


185 


: 3.51. 


•15 .. 


■ 3.46 


.01 


.08 


-5.73 


855 


169 


1.30 


.48 


1.45 


-19 


.06' 


-10.02 


- 1564 


109 


1.24 


• .16 


3.39 


.18 


.06 


-5.65 . 


880 


180 


2.31 


;33 


2.21^ 


.02 


.05 


-9-82 


1445 


177 


1-45 


.69 ' 


-44 


.16 


.05 


-10.67 




155 


1.51 


^70 


.36 


.15 


.04- 


-11.26 


1370 


209 


: 1.37 


.22 


'2.89 


.17 


. . 03 


--8.60 


■ 1167 


52 


1.51 


.61 


' .87 


.10 


.02 .. 


-12.96. 


1526 


184 < 


' 2.24 


•31 


2.34 


.01 


--00 


-12.64 


1395 


160 


i.iO 


.03 


5.42_ 


.'14 


-.02 


-.97 


22 


/ J 


1 7"7 - 


.53 


1.22' 


-.10 


-.04 


— - - 
-17.18 


1568 - 


" 148 


, 2.19 


.41 


1.82 


-.01 


-.06 


-17.88 


1521 


146 • 


2.13 


;45 


1.59 


-.14 


-.10 


-21.75 


1553. 








1 











Table 6. 

Statistics on selected achi-evemerit test items based oh 
a sample of 1664 elemerltary students. 



*Items which were selected for further analyses ~ see Section IV. 



o 

ERIC ^ , 



27 



the items (78/237) fit the mSdei (itj>2;0 Rejection value), aiid for many b£ 
these (25) the test was based on less than 100 cases. .Thus'i it couid be s 
argued that only 53 items fit the Rasch model. - A conSon rule of thumb for • 
rejection of ah item is mSQ>3.0 (Wright and Mead, 1977, 51), which would " 
lead to rejection of o^y 2 items! 

. Some items in Table 6 are .clearly very poor test items from a traditional 
point of view; and yet appear to be acceptable xrtien the= mean square residual • 
is examined. For example, itei #73, near the bottom of the list, has a iean 
square, residual of 1.77 and an item-total correlation of -.01! Figure 4 
shows the Rasch curve and a plot of item #73. It would be -absurd to^ include 
this item in a test, at least f rom ^any point of Vie^? but the mean square resid- ; 
ual. Table 7 shows the calculations of the RMSQ for item #73. ^ 

From the data in Table 6 ^/i interesting correlation can be' caicuiate^. 
The residual Sean square cpi^lates .50 with "the Rasch difficulty estimate. 
This seems strange— why should a fit statisticibe biased toward easy' items? 
This degree of correlation between the mean square residual and the proportibn 
of correct responses is also sesen in data which has been published. For the 
. parole, data in Per line, ^Wright and Wai-her (1977, p. 13), the -correlation is .55. 
The exam data in Rasch (1960, p. i06) gives rise to a .26 correlation, using 
Rasch 's estimates of the difficult^ and mean square residuals calculated by ^seif. 

It might be pointed but that ccJExelations between statistics can be mis- 
leading. For example, Perlihe, Wright andWainer (1977) attempt to demonstrate 
that Rasch ability estimates are essentially identical with ability estimates ob- 
tained Using additive conjoint measurement. The correlatibns between ability 
estimates based on two data sets are .997 and .990. However, correlations be- 
tween Riich ability estimates and the raw test scores to which they correspond 
are ;994 and ;993 on these same data. F^bi this point of view, there is no 



30 



28 



i4 



.2 



BMSQ = 1.77 
SLOPE = -.64 . r = 
N. = 1568 (±2 b-d) 
t = -17.182 
p < .001 



-;bl 



if: 



.6 



?(c) 



.5 



#73 




V . 



.2^ 



Figure 4. 

A comparison of the Rasch and one obtained 
item characteristic curve which fits accord- 
ing to the iSean square residual but does riot fit 
using traditional test item statistics. 



.1 



31 



ERICS 



85 



95 105 "ill5 



Raw Score 

i25 135 145 i55 i65 175 185" 195' 205 215 



29 - , 







N(c) 


-X 

e 




X 


2 


d • 


d 




- 








d 


1 


d 


'180.33 


1 


' .0055 


Si 


.20 


5 


i' 


79:33 


4 


.0126 


79.38 


.4d 


5 


2 


44.97- 


- 3 


.02 


90.00 


.38 




3 • 


28.81 


5 


. .03 _ 


86.58 


.45 


ii 


• ' 5 


19.80 


6 


.05 


- 99.30 


.40 


5 


2 


^ '14.24 


3 


'07 


28.69 


.59 


17 ; 


10 


id 56 


7 


no 


106.23 


.58 


19 


11 




8 




88.93 . 


.4d 


25 


10 


6.13 


J- J 


• XO 


63.70- ' 


.60 


42 


25 


4 75 


1 7 


91 


122.32 


.69 


54 


37 


3.70 


1 7 


97 


141.49 


.55 


65 


36 


2.88 > 


29 




• 113.63 


.67 


84 


56 


2.24 


28 


AS 


138.04 - 


.57 


122 


70 


1.73 




SR 


151.26 


.59 


147 


87 


1.33 


60 


- 7^ 


160.71 


.54, 


. 181 


- 98 


1.00 - 


83 


JL • WW 


180.00 


.51 


195 


99 


.73 


96 


1.36 


202.83 


.43 


: 252 


108 




144 


1.92 


332.64 


.44 


222 


^ 98 


.35 


124 


2.84 


3oo.4o " 


.58 


156 


90 


.22 


66 


• 4.56 


oon 7^ 


.-75 


44 ' 


33 


.12 


11 


8.50 


97.46 


.00 


4 


4 . 


.04 


0 


22.23 


.16 




1664 








2 

2 = 


=2991.78 



BMSQ =^ = 1^0 



- :: ^able 7. 

Calculatibn of the fit statistic for item #73 in Figure ^4, 
using data grouped into 10 .point raw score intervals ; 



32 



_ . ; . . 3b 

difference between raw scores and Rasch ability estimates, the correlation 
coefficient simply is not sensitive to the scale changes in the distance be- 
tween ability estimates which, are introduced by. Rasch ability estimates-- and 
multidimensional scaling. This scaling is an essential attribute of latent 
trait modeling because it allows for test-free person measurement, at least 
in, theory. * • . 

The practical significance of the .55 correlation can be seen iri- Table 
. 8^ which shows a crosi-tabuiation of mean square residuals with Rasch diffi- 

culty estimates. Notice that virtually all the items which have difficulty 
estimates of -1.0 or" lower (very easy items) also, hive a mean square residual 
(RMSQ) of .9 or less which would indicate very good fit to the Rasch model. 
Items of medium difficulty tend to have low RMSQ, while items of moderate to 
high difficulty tend to have high RMSQ. These figures indicate that the fit 
statistic is very sample dependent , ; which Wright' ^d Mead (1977, p.' 50) also 
point but. ' ^ ■ ^ 

The reason for this relationship betweeri the RMSQ and item difficulty is 
that many items are answered correctly more often than the Rasch model predicts. * 
This results in low RMSQ for easy items and high RMSQ for difficult items^. 

How selective must one be in order to construct tests which can be used 
in test-free person measurement?. This is ah empirical question that this 
paper only begins to address. For example. Figure 5 shows close correspondence 
. between the responses to item #220 and the Rasch curve. The slope of this item 
^characteristic. curve' is significantly steeper than the Rasch curve^ but it may 
not be deviant enough to warrant rejecting the item. That is, test-free person 
tneasurement may hot be ^ impaired by retaining item #220 as if it fit the model.' 
However, more careful item selection is nace*ssary than can be done using the 
mean square residual. For example. Figure 6 shows an item which has an item 

ERIC 



31 



j 



.Oa to .9(3 

meah 

square .90 to 1-1 
residxjai 

1.1 t o » 



66 


53 


19 


6 


" 31 • 


15 


1 


" 22 


30 



N = 237 
X^= 60.34 
p < .01 



to *1.0 -1.0 to 1.0 1.0 to 



Rasch difficulty 
estimate 



Table 8. 

Crqss-nabula-tibri of 'the mean square residuals with 
Rasch difficulty estimates. 



. 34 

ERIC 



.1.0 



32 



RMSQ = .983 




34 

which has an item characteristic curve much steeper than the Rasch curve. 
Inclusion' of items with widely varying slopes in a test violates the basic 
assumption of Rasch modeling, that ail items have the same discriminations 
(Wright, 1977, p. 103). • : ^ : ' 

Part iV. The Practical Implicatidns of Failure to 
Reject Items Which Do Not Fit the Rasch Model 

From Table 5, it appears that the items on this test battery differ 
greatly with respect to the slope of the item characteristic curves. Do . 
these differences actually affect the properties of tests which might be 
built using subsets of the items? Are these differences great enough, to 

iiQpair the use of such tests in a- testing program based on Rasch model theory? 

- _ _ _ ^ 

The slope of the itefc characteristic curve is one index of item discrimination^ 
and the Rasch irodei assumes all items in a test have the same discrimination.' 
If one were to base judgements of accept^ility of the items on the mean square 
standardized residuals ^ only a few of the items would be unacceptable. What 
; practical consequences mi^t arise if items with slopes as different as are 
found here are included in tests? 

the Estimation of Item Dif f icultj^^andLJFerson Ability 

in order to answer these questions, four tests of 20 items each were con- _ 
structed. The first^ or "Steep" test contained only items which had slopes' off' 
.29 or greater based oh a sample size of 1000 or more data points bettJ^eeti ±2 b-d 
The second, or "Rasch" test contained otily items which had slopes hot sighifi- 
c^tiy different than .2594, based on 1000 or more data points between ±2 b-d. 
The thir^^ or "Flat" test contained only items which had slopes less than .15, . 
based on 1000 or more "data points between ±2 b-d. A f ourth, or "Mixed" test ' 
was composed of seven items from the Steep test, six from the Rasch test] and 



37 



•35 

seveii frbo the Flat test. 

A sample of 1486 respondents was selected frda the 1554 ^ .each of which 
• had a score between 0 and 20 on each of the four tests; When each of these " 
tests was analyzed separately » very intersting Jesuits were obtained. In both 
the Steep and Flat tests, the slopes of the item characteristic curves suddenly 
seemed to be very close to the slope of the Rasch curve. The Steep test no 
longer seemed to contain Items with steep slopes and the Flat test no longer 
appeared to contain items with flat slopes. The RascS test still contained 
.items Which -fit the Rasch curve. Only the Mixed test seemed to contain items 
with di\7ergent slopes. 

Table 9 shows the slopes of the 20 items which were S the Mixed test as 
they were computed in two contexts. Fi^st, in the context of the Mixed test 
andj secondly, in the context of the tests containing only items with similar . 
slopes. These statistics were based on the same sample of 1480 students". 

These results were very disturbing. The Rasch model program apparently 
adjusts the slope of the items in any given test to the best fit with the 
Rasch curve. Looking it the output from ptogram BICM, (Wright and Mead, 1977;, 
there is little if any indication that the Flat test is any better or worse 
than the. Rasch and Steep tests. This adjusting of discrimin.ations is alluded 
to in the manual for us^ of the BIGAi computer program. The. program provides 
a "discrimination index,"' whicli *'is in fact the linear trend across score , 

groups. Values larger than one indicate that the observed characteristic curve 

_ . . ''^^y ^ 

for an item is steeper than the average best fitting logistic cSrve for all^ 

items; values less than one indicate the curve is flatter" (Wri^t and >fead, 
1977, p. 53). 

However, there are very substantial differences in the quality of the 



36 







Context 






Item $ 




Mixed TBSt. 


1 iKJiuo^eiieous lesc 




133 




.30 


.22 




95 




;27 


.20 . 




202 . 


i 
. a; 


.28 ; 


.22 




i2 




.30 


.25 




232 : 




. .3i 

.31 


.22 
.27 


* 


215 




.27 


.19 ■ 




•192 




.21 


' .25. 




229 

23i 


£ 


' , .22 
.21 


. .20 
\.23 




i9 
i63 


o 


.18 • ■ 
.19 


..15 

, .25 ' 




167 




.22 • ; 


- .25 




7i 




- 

(16 


.32 ; • 


• 


39 




.18 


.29 




33 


09 

£ 


.17 


.26 




194 


4J 
•H 


. .±3 


.20 




36 
175 




.14 
.14 


.21 
.18 




149 




.14 


,18 . 




* 




Table 9. ■ 







Slopes of item characteristic, curves for 28 items contained in 
the Mixed test as computed in that context arid in the context 
of items with similar jslbpes. 



3^ 



ERIC 



.: - - , -'^ :37 • 

Steep, Rasch, Flat, and Mixed tests. Table 10 shows some traditional test 
statistics ch these four tests. The alpha coefficients readily indicate t.::- 
tlie^^tests which', contain items with steeper slopes have 'higher internal fe- 
liabii^ty. Item-total correlations^ shown in the' lower portion of Table 10, 
also indicate drainatic differences between the i.tems in these tests. The 
phanges in slope observed in Table 9 between the mixed and homogeneous cor.- 
text do not appear in Table 10. That is, the item^total correlations in Table' 
id remain fairly constant regardless of the heterogeneity of the discriminations 
of the items in the test. 

In order to -verify these results., which were computed using Veldman's 
PRBffi library program RASCH, and to look for further information which might be 
provided by Wright's program BICAL, similar runs were 5ade rising BiCAL. Several 
initial comparison runs verified that all but one of the statistics which were 
computed by VeldiiOT's RASCH program agreed to the third decimal with those pro- 
vided by Wright •s BICAL when the PROX procedure was specified. The single ex- ^ 
ceptidii was the fit statistic, which was adjusted by very complex factors in 
BICAL for reasons noted above. The BICAL program provides much more output, 
however, ;and also allows for a supposedly more accurate calibration procedure, 

referred to as UCON, for "unconditional maximum likelihood estimation." [The 

•» 

"unconditibhal" refers, to the notion that the ^act probability of e~^ch response 
vector given a particular total raw score is not computed in the process of ''esti- 
mating item parameters, inti^uncdnditibhal p-^rbcedure approximates the conditional 
procedure, and is reported tb be less expensive aSd less .subject to round off 
errors (Wright and Mead, 1977, p'. 23).] . . 

It was apparent from the output of these analyses that the procedure for , ' 
adjusting the average slope of the item characteristic curves to that specified 
by the Rasch .model (Wgardless. of the data fed into the program) involves 

40 ' - ' ■ 



38 



Meaii 

Sigma 

Alpha 



Steep 

31.68 
5.57 
.91 



Rascfa 

31.45 

4.13. 
.78 



31; 95 
3;17 
;6r 



Mixed 

32; 59 
4.32 
.8i V 



Iteia-Total Cdrrel atidns Within Different Contexts 



Item jy 

133 
96 
202 
- 12 
232 " 

14 
215 



CO 



o. 

a 



2Q Item Mixed 
teit 

.63 
;54 
•• .60 
.59 
.63 
i61 
.56 



20 item Homo- 
^etteous Test 

.64 . 
■ .57 
.64 
.64 
.63 ■ 
.65 
.59 



192 
229 
231 
19 
163 
167 



4J 

CO 
(U 
U 

u 

03 

to 



.49. 
.48. 
'-48 
.38 
.42 
.45 



:50 
.43 
.49 
.37 
.46 
.49 



39 

33 
194 

36 
175 
149 



a 
to 



.38 
.40 
-.37 
;37 
.34 
.25 

.3a, 



.47 
.45 
.41 

;35 
.36 
.26 
,30 



, Table ^a^ - . " 

Some traditional test statistics- for the items in -the 20-item 
Mixed test computed in that context and the context of the homogeneous tests. 



39 



- adjustments of th§ estimated abilities of the itudents and difficulties of the 
items; Recall that Rasch model item characteristic curves use ability minus 
difficulty ^b-d) as the abscissa (X axis) . ° Even minor tranformations of the 
Values of the abscissa can easily affect the calculated slope of a line. To 
reduce the slope, of a line by one-half , ^simply double the values of the abscissi 

The procedure used to adjust the. slope of an item characteristic curr^e to 
that of the Rasch model does this by transforming the scale of the ability arid 

_ t5 

difficulty estimates.- For Example, the range of ability in^the sample of 14iB0 
students was estimated to be 7.43 uSits'" based on the Steep- test and 6.51 i^its^ 
based on the Mixed test. -item, difficulty estiiaates also vary syst^tically. 
The distance between item #96 (the most difficult item co^n to the Steep and 
Mixed tests) 'and item- #14 (the easiest item iS coSnbn) was ^'imated to be 2.66 
units when difficulty estimates wefe obtained in the 20-ft^m Steep test aid 2.^2 
-units when based on the 2e-item Mixed test, using the same' s^ple of 1480 stu- 
dents' data. The dif ference^in the scale of ability estimates is combined with ■ 
the difference in the scale of difficulty estimates when the ability - minus - 
difficulty (b-d) scale is foMSTtS produce a change- in the apparent slope of 
an item characteristic curve. Notice in Table 9^ that the slope of the item 
.'characteristic cnrve -for each item in the. Steep test is .less when bted da the 
homogeneous (Steep) test than when Based on the Mixed' test. ' 

It has been stated that' "item discrimination earinof be estimated directly 
or efficiently ii the way Rasch item difficulty and person ability can" (Fright, ' 
1977, p. 104). Wright criticized Lord (1975) for i^osing arbitrary constraints , 
on his procedure for. estimating discriiiSation, but it see^ that Wright's pro- 
cedure also imposes a very rigid constraint — that' the "average best fittini .-, 
logistic curve for all items"" (Wright kd M^ad, 1977^ p. 53) must be the same • 
as the Rasch model. The Rasch model does ^not have^a parameter for item ^ 



40; 



discriminatidh; this does not mean that the Rasch mbael d6es not specify what 
the disdfiiination "is it merely asserts that eve^y iteta has the s^ dis- 
: cr^ination. In order to detnSnstrate that the- items in a particular test fit 
the Rasch model, it is Secessary to show that they have the particular dis-' 
crimination specified by that model. This is done by adjusting the ability 
and difficulty' estimates and using these" as the abscissa when plotting the 
item characteristic curves. ; 

- . * « 

Person-Fre e Test C alibration 

What effect do variations in the slope of^ the item characteristic cui^ 
have on person-free tist calibration? In order to investigate this, three 
ability groups were defined but of the sample of _ 1480. Wright and Mead- (1977, 
p. 46) suggest that extremely high aSd low scores should not be included in " 

item difficulty calibration attei^ts. Therefore, students with Sore than 49 

- , - ^ ' 

or less than 17 correct responses to the 60 items iS the Steep, Rasch, and 

Flat tfsts were excluded". The "low ability group contained 494 students with 
scoreiES ot 17 through 31. The high ability group contained 494 students v±ih 
scores of 40 through 49. A medium ability group contained' 367 students with . 
scores of 32 through 39. A total of 1355 students thus remained?, divided into: 
these three ability grouijs. 

, A new Mixed test was also defSied, containing 10 items from^ the S^teep 

test and 10 from the Flat test. This Mixed test was coiq>osed of items which 
had very nearly the same level of difficulty in otder to illustrate as clearly 
as possible the potential hazards of including items with different discrimi- 
nations in a single test. Between 25% and 68% of the 1355 students marked each 
of these items correctly. ~ • 

Difficulty estimates for the 20 items in this Mixed test were calculated 
using the high and low ability groups. Table 11 shows th^ estimated difficulties 



ERIC 43 



Bi fficulty Estimates 



17 
18 
133 
• 202. 

12 
215 
13 ; 
5 

152 

156 
±71 
173 

71 
194 
175 
161 

65 

84 
178 
139 



-.45 
-37 
-.08 

-.1.75 

-1.45 
-.04 ■ 
-.14 
-.74 
-.24 
.70 
.33 
.38 

-.26 
.16 

-.14 

-.61 

..39 
.:68 

1.53 

1.37 



Ability Group 

Low 

.23" 

1.58 

1.18 

-.16 

-.39 

.79 

.39 

.17 

.64 

1.16 

-.33 

-.48 

-.74 

-.78 

-1.59' 

-1.03 

-.44 

-.44 

.05 

.18 



Table 11. ' 

Difficulty estimates for a set of 20 items based on 
high and low ability groups. 



FRir 



44 



42 



of esch item based on tw8 ability groups. These iiifficulty estimates wer"e 
computed nilnz Weight and Mead's program 516^^^. The correlation between diffi^ 
cuity estimates based on the high aid low ability groups is -.42. "if gross 
variation in item discrimination is tolerated _in the final pool of test it^^ 
then the possibility of person-free test calibration is lost (Wright, 1968, p. 106). 

tabie_12 shows the fit statistics for each of -these' 20 items as reported 
by program BICAL for the high ^d low ability groups and the total sample. 
Notice that only , the "Between Grbup^ Fit Ifean Square" based on^the total sample ' 
gives ^y indication that thes4 items do not fit the Rasch model." According 
to. Wright and Mead, "the between .group mean square tests the agreement between 
the observed item characteristic curve aAd the best -fitting Rasch characteristic 
curve asfestimated by the groups selected" (1977. p. 52). A non-significant 
betweeS groups mean square "indicates that statistically equivalent estimates 
of difficulty: would result froi using either the, low scorei; or'the high scores 
for calibration (Wright and Mead; 1977, p/ 51). Even so,_they do not recommend 
.dropping items from a test if they have high between group fit mean Squares. 
This statistic is apparently the only tedication on the BICAL printouts that 
there may be !.-.omething amiss with the l^ixed test. Even so, it oSly indicates ' 
thi| when these iteis have been calibrated on a sample containing a wide dis- 

trilktion of 'ability. When the calibrations aiS^fit statistics are based only ' 

._ _ • > 

on the high or low ability groups, there is no way to tell from the BICAL print- 
outs that the difficulty estimates would not be stable across abilitjr groups. ' 

The reason these difficulty estimates are so dif feren^when based on the 
high and low ability groups: is .that the item characteristic curves cross. • • 
Items £i6i the Flat test are "more difficult" within the low ability group, 
but "less difficult" within the high ibiiity group. Rasch model proponents 
have clearly recognized that items which have crossing characteristic curves 



: * 




43 



Item ? 


Between Group Fit 
Hean. Squares 
High iow Total' 


High 


Tbtai Fit 
^^ea^ Squares 

Low fatal 


17 


1.37 


i.08 


7.41 


1.03 


.95 


.92 




'2.35: 


2.75 


15.72 


1.02 


.82 


.83 


133 


1.58 


2.15 


20; 69 


.99 


.84 


.82 


202 


1.68 


3.67, 


15.55 


.84 


.99 


.82 


12 


.33 


1.81 


9.^62 


.98 


i.dl 


.87 


215 


.90 


3.50 : 


14.89 


.98 


.85 


.86 


13 


2.39 


" 2.94 


4;48 


.95 


1.01 


.94 


5 


1.82 


2.43 


13.86 


.89 


.94 


.85 


152 


. .97 


: 2,16 


10.30 


1.02 


.91 


.90 


166 


..43 


i.57 


3.50 . ; 


. 1.07 


.86 


.93 


171 


2.02 


1.12 


2.13- 


.98 


\%1.01 


1.08 


173 


.86 

l86 


1.49 


6.87 


; 1.08 


vies 


1.18 


71 


.32 


1.83 


.99 




IciO' 


194 


.35 


1.34 


6.80 


1.05 


/ 1.08- 


1.18' 


175 


1.55 


i.03 


19.02 


d.08 


1.08 


1.34 


161 
65 

c 


;82 
.15 


1.18 
2.65 


.'2.13 ^ 
7.18 


-. .98 
1.05 


1.12 
1.13 


1.09 
1.19 


§4 


2.53 


• .67 


10.42 


1.05 


1.08 


1.22 


178 ^ 


3.31 


4.90 


28.54 - 


1.12 ■ 


1.15 


1.40.. 


139 


.94. 


1.91 


11.78 


1.06 


1.06 


1.22 



Table 12. 

Residual mean square fit statistics for the 20 items in the 
• Mixed test compute's shy program BICAL tisirig three 
saL. .fes: high ability, low ability, and total group. 



46 



> 



44 

cannot be, allowed to ^eiato in / test if Rasch tabciei procedures a« to be 
used (Wfi^t, 1968) . What I have demonstrated is that the standardi2ed resid- 
ual mean square fit statistic does not provide the information necessary to 
select items with similar slopes. 

Vertical Equating 

These differences in difficulty estimates affect the results of att^ts 
to link tests, also. For exaople, no items are cbmnion to both the Steep ai-d 
Flat tests, but: they can.be linked through the Mixed test. Ten items in the 
Mixed test are coinmoh to the Flat test aSd tin to the Steep test. Following 
the i)rdcedure specified by Wright (1977, p. 407), we -can link the:Flat and 
Steep tests using the calibration of the Mixed test. 

The constant necessary to translate all it^ difficulties in the cali- 
bration of the Mixed test onto the scale of the Flat test would be 

10 . 

♦ 'mf Vi?i "^^im-^if^^^^- ' 

The constant of the trar^slation between the Steep test and the Mixed 
test would be ' ; 

10 - ; 

^sm = ill ^±^Q^^' ' ^ 

T^le 13 shows the difficulties estimated for the linking items oh 
these three tests ^ with the high ^d low ability estimates for the Mixed test^ 
Using the difficulty estimate-^ based on the low ability group, the tv^o con- 
stants are ; 

^nf -^30 and , - . 

t_.. = i404; 
sm . • 

Using the difficulty estimates based ofi the high ability group, the 



45 



Item 

•N 

±7 

18 
133 
202. 

12 
215 . 

13 
5 

152 
166 

171 
173 

71 
194 
175 
161 

65 

84 
178 
139 



Fiat 


- Mixed. 




Steep 

1355 


1355 


4ow(494) 


hiSh(4945 




.233 


-.453 


1.199 




1.580 


. .371 


' .735 




1.180 


-.077 


^.526 




-.158 


-1.750 


-.549 




' -.393 


-i.45i 


- .590 




.787 


-.035 


.393 






-. 140 


-.058 




.172 


-.745 


.389 


- 


•642 ,j 


- -.23r 


1.238 




1.162 


.696 - 


.159 


.306 


-.3I3 


.327 




. 312 


-.478 


.380 




-.074 


-.755 


-.261 




.026 


-:785- 


.161 




-.660 


-1.593 


-.140 




-.450 


-1.034 


-.6i3 




.377 


-.456 


.389 — 




.476 


-.444 


.680 




1.225 


.055 


1.533 




1.160. 


.182 


1.368 





Table 13. 



- ^ - ^ - _ y 

Linking the Flat and Steep tests through the Mixed test^ using either 
high or low-, ability groups to calibrate the difficulties 



of items in the Mixed test, 

48 



46 



c«o constants axe 

t^ = -.117 and 



t = -.739; 
sm 



Thus*, the deference between the difficulties of the Flat Steep test is 
estimated to be 1.2|4 units ^using the l:&k based on low ability students and ' 
-.856 units using the link based on the high ability students. The average 
• scores of |he 1355 students to these testi were 11.19 ^Steep test) ^d 12.02 
fFlat test), indicating that the Flat test is sli^tiy easier than the Steep 
test. • ^ ° 

- The ability estimate fo^ a person with a score of 11 oh the' Steep test 
is .24 according to the BieAi. printout, based oh the 1355 studehts in the sample 
and the 26 items in the test. However, translating to the scale of the Flat 
test, a score of 11 bh the Steep test would reflect an ability of 

using the low aibiiity link and 

^' -^^^^^W^ ^^ - -60 

using the high ability link. Such a great difference (2.09 units) would be 
intolierabie in a testing situation. (In: terms of a iore familiar metric^ 

-.60 is roughly equivalent to the 25th percentile and 1.49 to the" 76th per- 

_ - P ^ " 

centile bh the Steep test.) 

To my knovd.edge, no other study has reported such unacceptable results 
tisteg the Rasch model. The selection of it^ for this test of saii^sle-free item 
calibration and vertical equating was intended to produce as dramatic results is" 
possible. The fact that Sa^ other studies have bbtained at least mifgiSitiiy 
acceptable results using theRarch model indicates that the Rasch model is 



V - " 47 

fairly robust. That is, even though variation in item discrimination can have 
draasatic, adverse effects on item ccalihration ^d test linking; in man^ situations 
such effects have not. been observed. Future studies need to monitor item dis- 
criminations more closely and thus determine the extent to which the assumption 
of equal discrimination can be violated in practice. 

Part V. Conclusions and Suggestions for Further Research 

This investigation has demonstrated two things. Firsts the discriminations 
of a?.l the items in a test must be very similar in order for Rasch model ^iyses 
to work in practice. .Second^ the standardized residual iean square fit statis- 
tic does not detect unacceptable variation in discrimination. These findings 
were used to construct tests in which item discriminations were dissimilar 
enough to produce discrepancies in difficulty estimates based on hi^ and low 
ability groups. These discrepancies were then shown to lead to serious errors 
in test linking. The major i^lication is that item discrimination needs to 
be monitored and controlled using more exact tests fit than the residual 
mean square. If item discrimination is carefully controlled, then vertical 
equating^ sample-free test calibration and test-free person measurement might 
be possible. 

^ • ■ -. 

The.- successful juse of the linear model, to determine the slope of it^ 

characteristic curves may be indicative of its potential as a latent trait 
model. If one were to restrict the range of specification of the model to/ 
say, .1 < ?(c5^ .9, then the linear model is practically equivalent to the 
logistic model (see Figure 4) . - " * 

The linear model has two" parameters: the slope (or discrimination) of. 
the item and the intercept (or difficulty) of the item. One laight define the' 
difficulty of an item as the point at which the; item characteristic ciirve 



48 

iiaicates a .5 probability of success. Previously in this p^i , the linear 
model was specified as P(c) = mx + where 5 is the slope of the line, x 
equals ability minus difficulty and b equals .5, the expected value of P(c) 
when ability minus difficidty equals zero. The model tbuld be reformulated 
by defining x as the ability of the person and b equal to (.5-m5), where m is 
the slope of the line and 5 is the difficulty of the item, teast squares 
estimators of m and b are avaiiabie which are unbiased, efficient* and con- 
sistent. These estimators are very weii suited to the problem at hand because 
tests of hypotheses concerning m and b are very robust to \?lblatibhs of assump- 
tions of equal variance and normality of thi distributions of P(c) given x (George, 

1977). These are exactly the type of distributions that occur in test item data. 

» _ _ 

Perhaps the major difficulty with estimating fit of items to any model is 

that few models do well outside of the .1 < P(c) < .9 raSge. i suggest that 

one does not need to model pefform^ce outside of this range. Indeed, a fully" 

consistent program of individualized testing would riot use items which are too 

easy or difficult to estimate ability. Nor would data from persons with a- 

^ilities too far above or below an item's difficulty be used to estimate that 

difficulty. It-is likely that ^ regardless of which model one wishes to use, 

data outside the ^.i to .9 P(c) range adversely affects the estimates of the 

parameters of the model. Thus, it. should be possible to devise algorithms ; 

which improve the estimates of ability and difficulty and discriminatjfin by 

eliminating items and persons which do not provide reliable information. 

One example of the use of the linear model would be in selection of a 

set of items which satisfies the assumption of the Rasch model that ail items 

in a test have the same slope. If we define this to mean that no two items 

in a test have item character is tic curves which cross in the .1 < P(c) < .9 range, 

the linear model c^ be used to select such items ^ In order to. determine whether 

51 ^ 



ERIC 



49 



two item chanictcirisclc curves cross in irJus range; Wc specify their (>quati<>hs to 

.-- - ; . _ 

be P(c) » JBj6 + .5-n!^6^ and P(c) = m^B + .S-m^^^- The point of intersectibti is 

• *i -°2 ' ^ , ■ ■ ■■ . 

whe^e d^ ind d^ are- the estimated difficulties of the items. ' If pi is less 
than .1 or greater than .9, then both items would be included in the test. 
If pi is greater than .1 aSd less- than .9, then the item with the steeped slope 
should be retailed and the item wirh' the flatter slope dropped Som the test. 

This selection procedure was ajiplied to the 237 item test discussed pre- 
viously using Rasch -difficulty estimated provided by the PROX procedure and 
. slope estimates based on £he responses of individuals within ±2 (5-d) units 
Csee. table 6). Thirty iteas were retained, each of which did not cross any of 
the others retained. Every other item In the test (i.e., 207 items) crossed 
one or more of the retained items and had a flatter slope. Table 14 shows the 
item numbers, slope and diffictilty estiiaates of these 30 iteis. Other statis- 
tics for most of these items are included in Table 6. - 

This procedure should provide a very good method of building tests that 
have all the properties desired by Rasch model proponents. The range of dif- 
ficulty estimated by the PROX procedure for the 30 items is -2.16 to *4.8i in 
the context of the 237 items. It may simply not be necessary to use any other 
' items from the 237 to obtain as accurate ^ estState of ability as is possible. 
The alpha coefficient for this 30 itei te^t' is ;92, based- on the 1664 persons 
described earlier. (Rasch model iteia^difficulty estimates in thi^conteit of 
the 30 item test Singe from -2.54 to 5:61, further illustrating the context 
specific nature of these estimates.) 

Another aspect of latent trait iodeiing that is Very promising is the 



er|c - ^§ 



50 





Slope of Linear 
Model — .- 


Rasch Diffici 
Estimate 


±\j 


• 342 


."■-;639 


"1 0 


• 320 , 


1.482 


JL J 


i306 . 
. 315 


1.033 
-.042-^ 




. ..30i 


.539 




• 337 


1.974 




• 336 


1.608 




• 295 


2.764 




• 348 


- r2.159 




• 286 


-1.907 




.324 


2.020 


07 


• 322 


-1.232 




• 323 


-l.;13^ 


XO/ 


.333 


3./55 




• 277 


-1.338 




.290 


, .223 




• 329 


- .644 




.347 ' 


2.502 


XJO 


.356 


- ^ • 

2.241 


1 AO 

xo^ 


-264 


4.813 


xoo . 


.296 


1.172 


1 CO 


.299 


.083 




" .293 


.157 


Zoo 


.339 


-.886 


-20/ 


.317 


. -.918 




;316 


-.244 


215 


-314] 


1.327 


225 


,317| 

: .293 , 

■ 1 ' 


-.434 


227 


.157 


232 


.318 


' .356 



Table 14. ; 

Items retained by testing for hon- intersection of item characteristic 
.carves in the range of .1 to .9 probability of correct response. 



ERIC 



03 



51 



hope b£ determining t^heri items and tests aie biased- for or against' specific 
individuals or groups of individuals. The Rasch model fit statistic discussed 
earlier has been proposed as ^ index of such bias (Wright. Mead and Draba- 
1976). However, it is clear from this paper how inadequate that statistic is. 
The linear model might provide a practical means of detecting such bias by 
letting X equal item difficulty instead of person ability. The slope (m) 
might then be a good index of the extent to which each person brought the tar- 
iet ability to bear on the items in the test, in addition, the standard error 
.of estimate can be calculated for each parameter in this model. Tnis statis- 
tic provides ah index of the accuracy of the estimate of each person's ability. 

Ultimately, the future of latent trait models depends upon -the acceptance 
they receive from those who do testing on a daily basis. It is important that ' 
these people understand the models we propose and how they shoSld be used. 
Most practitioners iiave worked ^th linear models in other contexts. This is 
one more reason I advocate a linear model over the logistic models currently 
receiving the most intense analyses. ■ 



References- 



Anderson, J.. Kearney. G. E., S Everett. A. V. An evaluation of Rasch's 
structural mo^|l for test ttens. - Thal^ritish Journal of Hatbe^t^^^l 
Psychol ogy . 1968. 21. 231-283; biiSi _ 

f^'S^^* and .equivalent scores, in R. L. Thomdike (Ed.) .. 

Educational MPasnr^tn^nr • (2nd ed.). Washington. D. C: American Council 
on Education, 1971, 50S-600. - , • ""^'^ ^.ouncix 

^°^^of ^iolftions^of assumptions on the power and robuitness" 

of the F-test for equality of means us.ing fixed effects linear regression; 
Bissertatidii. University of Texas. 1977. , • ^ , _ 

Lord. F. M. A theory of test scores. Psychometric aonograph . 1952. No. 7. 

1-^1,^1 relation of test score to the trait underlying the test. ' ' 
Educational and Psy c hological Measurement . 1953. 13_, 517-548. 

tord.^F. M.^ Relative efficiency of number-right and_formula scores. Britislv ■ 
Jou^mal of Mathematic a l an d Statisti ^ajr-Psf cholo^y. 1975. 28. 46-50l ^ 

Panchapafcesan . N. The simple iojgistic model and mental measurement. Doctoral ^ 
dissertation. University of Chicago. 1969. 

•Periine..R.. Wright. B. D. . & Wainer. h|\ The Rasch model as additive conjoint 
measurement. Research Memorandum^Ho^^ . Statisfcical LabSraEory. Depart-' 
ment of Education, ~The University of Chicago. 1977.- 

Rasch, G. ■ Probabilisti^^aodel^for s ome intelligence- and arr^inae^^^^esfe^ " 
Copenhagen: Danish Institute for Educational Research, i960. 

Slindg-,^J. A.. S Linn._ S:: L. . An exploration of the adequacy " of the Rasch ' 
model for the problem of .vertical equating. Journal of Educational 
Measurement , 1978, 15. 23-35. ~ 

"^^"^^P^ ?■ ^' V- ^ investigation of the Rasch simple logistic 
nladel: Sample free item_ahd_test calibration. Educati onal arid Psyho- 
logt c a l M e ast irement , 1975. 35, 325-339. , : ^ — 

Veldman. D. , S ^R^ s y stem:. Computer programs- f or statistical ana lv.t. 
Austin:^ Research and Development Center for Teacher Educatidn, The 
University of Texas. 1978. , - 

Ward.^J. H.. & Jennings, E. la£rbduction to linear n,n^.To^ v..,^..^ Cliffs. " 
. New Jersey: Prentice-Hall, 1973. ' : " ^J-^^". 



: 53 

Whicely, S. E. , & Dawis, R. V. The nature of objectivity_with the Ras'ch tabdel 
Journal of Edacationai Measuretaent - , 19^4, 11^ 163-178. 

Wright^ B. D. Sample free test calibration and person -measurement. Proceed- 
ings of the 1967 inyitational conference on testing problems- Princeton, 
New Jersey: Educational Tggting Service, 1968, ' 85-i0i, 

Wright,^ B. D. Solving measur^en^'problCTs with the Rasch model. Journal 
of Educational Measurement , i977, 14, 97,-116. 

Wright, B. D., & Mead, R. J. BICAL; _Calibrating rating scales with the: Rasch 
model. Res earch Memoran dum No. 23 . Chicago: Statistical Laboratory, 
Department of Education^ The Uriiverstity of Chicago, 1976. 

Wright, B. D., Mead^ R. J., Draba, R. E. Detecting and correcting test 

item bias with a logistic response model. Research Memorandum^ No . 22 ^ 
Statistical Laboratory, Department of Educatibn, The University of 
Chicago,. .1976. 



Wr:ight, B. D., S Panchapakesan, N. A. A procedure for sample free item 

analysis. Educati onal and Psychological Measurement , 1969, 29, 23-48, 



56 



